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Abstract. As multicore systems continue to gain ground in the High 
Performance Computing world, linear algebra algorithms have to be re- 
formulated or new algorithms have to be developed in order to take ad- 
vantage of the architectural features on these new processors. Fine grain 
parallelism becomes a major requirement and introduces the necessity of 
loose synchronization in the parallel execution of an operation. This pa- 
per presents an algorithm for the QR factorization where the operations 
can be represented as a sequence of small tasks that operate on square 
blocks of data. These tasks can be dynamically scheduled for execution 
based on the dependencies among them and on the availability of com- 
putational resources. This may result in an out of order execution of the 
tasks which will completely hide the presence of intrinsically sequential 
tasks in the factorization. Performance comparisons are presented with 
the LAPACK algorithm for QR factorization where parallelism can only 
be exploited at the level of the BLAS operations. 



1 Introduction 

In the last twenty years, microprocessor manufacturers have been driven to- 
wards higher performance rates only by the exploitation of higher degrees of 
Instruction Level Parallelism (ILP). Based on this approach, several generations 
of processors have been built where clock frequencies were higher and higher 
and pipelines were deeper and deeper. As a result, applications could benefit 
from these innovations and achieve higher performance simply by relying on 
compilers that could efficiently exploit ILP. Due to a number of physical limi- 
tations (mostly power consumption and heat dissipation) this approach cannot 
be pushed any further. For this reason, chip designers have moved their focus 
from ILP to Thread Level Parallelism (TLP) where higher performance can be 
achieved by replicating execution units (or cores) on the die while keeping the 
clock rates in a range where power consumption and heat dissipation do not rep- 
resent a problem. Multicore processors clearly represent the future of computing. 
It is easy to imagine that multicore technologies will have a deep impact on the 
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High Performance Computing (HPC) world where high processor counts are in- 
volved and, thus, limiting power consumption and heat dissipation is a major 
requirement. The Top500 [1] list released in June 2007 shows that the number of 
dual-core Intel Woodcrest processors grew in six months (i.e. from the previous 
list) from 31 to 205 and that 90 more systems are based on dual-core AMD 
Opteron processors. 

Even if many attempts have been made in the past to develop parallelizing 
compilers, they proved themselves efficient only on a restricted class of problems. 
As a result, at this stage of the multicore era, programmers cannot rely on com- 
pilers to take advantage of the multiple execution units present on a processor. 
All the applications that were not explicitly coded to be run on parallel archi- 
tectures must be rewritten with parallelism in mind. Also, those applications 
that could exploit parallelism may need considerable rework in order to take 
advantage of the fine-grain parallelism features provided by multicores. 

The current set of multicore chips from Intel and AMD are for the most part 
multiple processors glued together on the same chip. There are many scalability 
issues to this approach and it is unlikely that type of architecture will scale up 
beyond 8 or 16 cores. Even though it is not yet clear how chip designers are going 
to address these issues, it is possible to identify some properties that algorithms 
must have in order to match high degrees of TLP: 

fine granularity: cores are (and probably will be) associated with relatively 
small local memories (either caches or explicitly managed memories like in 
the case of the STI Cell [20] architecture or the Intel Polaris [3] prototype). 
This requires splitting an operation into tasks that operate on small portions 
of data in order to reduce bus traffic and improve data locality. 

asynchronicity: as the degree of TLP grows and granularity of the operations 
becomes smaller, the presence of synchronization points in a parallel execu- 
tion seriously affects the efficiency of an algorithm. 

The LAPACK [5] and ScaLAPACK [9] software libraries represent a de facto 
standard for high performance dense Linear Algebra computations and have 
been developed, respectively, for shared-memory and distributed-memory archi- 
tectures. In both cases exploitation of parallelism comes from the availability of 
parallel BLAS. In the LAPACK Ccl.SC. M number of BLAS libraries can be used 
to take advantage of multiple processing units on shared memory systems; for 
example, the freely distributed ATLAS [25] and GotoBLAS [14] or other ven- 
dor BLAS like Intel MKL [2] are popular choices. These parallel BLAS libraries 
use common techniques for shared memory parallelization like pThreads [19] or 
OpcnMP [11]. This is represented in Figure 1 (left). 

In the ScaLAPACK case, parallelism is exploited by PBLAS [10] which is 
a parallel BLAS implementation that uses the Message Passing Interface [12] 
(MPI) for communications on a distributed memory system. Substantially, both 
LAPACK and ScaLAPACK implement sequential algorithms that rely on par- 
allel building blocks (i.e., the BLAS operations). As multicore systems require 
finer granularity and higher asynchronicity, considerable advantages may be ob- 
tained by reformulating old algorithms or developing new algorithms in a way 
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Fig. 1. Transition from sequential algorithms that rely on parallel BLAS to parallel 
algorithms. 



that their implementation can be easily mapped on these new architectures. This 
transition is shown in Figure 1. An approach along these lines has already been 
proposed in [7, 8, 18] where operations in the standard LAPACK algorithms for 
some common factorizations were broken into sequences of smaller tasks in order 
to achieve finer granularity and higher flexibility in the scheduling of tasks to 
cores. The importance of fine granularity algorithms is also shown in [17]. 

The rest of this document shows how this can be achieved for the QR fac- 
torization. Section 2 describes the algorithm for block QR factorization used in 
the LAPACK library; Section 3 describes the tiled QR factorization that pro- 
vides both fine granularity and high level of asynchronicity; performance results 
for this algorithm arc shown in Section 4. Finally future working directions are 
illustrated in Section 5. 



2 Block QR Factorization 

2.1 Description of the block QR Factorization 

The QR factorization is a transformation that factorizes an m x n matrix A 
into its factors Q and R where Q is a unitary matrix of size n x n and R is 
a triangular matrix of size m x m. This factorization is operated by applying 
min(m, n) Householder reflections to the matrix A. Since Householder reflections 
are orthogonal transformations, this factorization is stable as opposed to the LU 
one; however, stability comes at the price of a higher flop count: QR requires 
2n 2 (m— n/3) as opposed to the n 2 (m—n/3) needed for LU. A detailed discussion 
of the QR factorization can be found in [13, 23, 24]. LAPACK uses a particular 
version of this algorithm which achieves higher performance on architectures with 
memory hierarchies thanks to blocking. This algorithm is based on accumulating 
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a number of Householder transformations in what is called a panel factorization 
which are, then, applied all at once by means of high performance Level 3 BLAS 
operations. The technique used to accumulate Householder transformation was 
introduced in [22] and is known as the compact WY technique. 

The LAPACK subroutine that performs the QR factorization is called DGEQRF 
and is explained below. Consider a matrix A of size mxn that can be represented 
as 

'An A 12 
A 2 i A 22 

where An is of size 6x6, Ai 2 of size 6 x (n — b), A 2 \ of size (m — b)xb and A 22 
of size (m — b) x (n — b). 

The LAPACK algorithm for QR factorization can be described as a sequence 
of steps where, at each step, the transformation in Equation (1) is performed. 



.^21 -422 J \V 21 J ' \ A 22/ 
The transformation in Equation (1) is obtained in two steps: 

1. Panel Factorization. At this step a QR transformation of the panel 
is performed as in Equation (2). 

This operation produces b Householder reflectors (K,i) and an upper trian- 
gular matrix i?n of size 6x6, which is a portion of the final R factor, by 
means of the DGEQR2 LAPACK subroutine; also, at this step, a triangular 
matrix Tn of size 6 x 6 by means of the DLARFT LAPACK subroutine 4 . Please 
note that Vn is a unit lower triangular matrix of size 6x6. The arrays Vii 
and i?n do not need extra space to be stored since they overwrite A^. A 
temporary workspace is needed to store T\\. 

2. Trailing submatrix update. At this step, the transformation that was 
computed in the panel factorization is applied to the rest of the matrix, also 
called trailing submatrix as shown in Equation (3). 



R ™\ = ( r _(Vii\-(T 11 y(V 1 T 1 V 2 T 1 ) \(A l2 
A 22 { \V 21 { A 22 



(3) 



This operation, performed by means of the DLARFB LAPACK subroutine, 
produces a portion R\ 2 of the final R factor of size 6 x (n—b) and the matrix 

^22- 

The QR factorization is continued by applying the transformation (1) to the sub- 
matrix A 22 and, then, iteratively, until the end of the matrix A is reached. The 
value of 6 <C m, n (the so called block size) is set by default to 32 in LAPACK- 
3.1.1 but different values may be more appropriate, and provide higher perfor- 
mance, depending on the architecture characteristics. 



for the meaning of the matrix Tn please refer to [22] 
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2.2 Scalability limits of the LAPACK implementation 

The LAPACK algorithm for QR factorization can use any flavor of parallel 
BLAS to exploit parallelism on a multicore, shared-memory architecture. This 
approach, however, has a number of limitations due to the nature of the trans- 
formation in Equation (2), i.e., the panel factorization. Both the DGEQR2 and 
the DLARFT are rich in Level 2 BLAS operations that cannot be efficiently paral- 
lelized on currently available shared memory machines. To understand this, it is 
important to note that Level 2 BLAS operations can be, generally speaking, de- 
fined as all those operations where 0(n 2 ) floating-point operations are performed 
on 0(n 2 ) floating-point data; thus, the speed of Level 2 BLAS computations is 
limited by the speed at which the memory bus can feed the cores. On current 
multicores architectures, there is a vast disproportion between the bus band- 
width and the speed of the cores. For example the Intel Clovertown processor is 
equipped with four cores each capable of a double precision peak performance of 
10.64 GFlop/s (that is to say a peak of 42.56 GFlop/s for four cores) while the 
bus bandwidth peak is 10.64 GB/s which provides 1.33 GWords/s (a word being 
a 64 bit double precision number). As a result, since one core is largely enough 
to saturate the bus, using two or more cores does not provide any significant 
benefit. The LAPACK algorithm for QR factorization is, thus, characterized by 
the presence of a sequential operation (i.e., the panel factorization) which repre- 
sents a small fraction of the total number of FLOPS performed (0(n 2 ) FLOPS 
for a total of 0(n 3 ) FLOPS) but limits the scalability of the block QR factoriza- 
tion on a multicore system when parallelism is only exploited at the level of the 
BLAS routines. This approach will be referred to as the fork-join approach since 
the execution flow of the QR factorization would show a sequence of sequential 
operations (i.e. the panel factorizations) interleaved to parallel ones (i.e., the 
trailing submatrix updates). 





Intel MKL-9.1 


GotoBLAS-1.15 


# cores 


DGEQR2 


DGEQRF 


DGEQR2 


DGEQRF 




Gflop/s 


Gflop/s 


Gflop/s 


Gflop/s 


1 


0.4106 


2.9 


0.4549 


3.31 


2 


0.4105 


4.95 


0.4558 


5.51 


4 


0.4105 


8.79 


0.4557 


9.69 


8 


0.4109 


14.3 


0.4549 


10.58 


16 


0.4103 


16.89 


0.4558 


13.01 



Table 1. Scalability of the fork-join parallclization on a 8-way dual Opteron system 
(sixteen cores total). 



Table 1 shows the scalability limits of the panel factorization and how this 
affects the scalability of the whole QR factorization on an 8-way dual-core AMD 
Opteron system with MKL-9.1 and GotoBLAS-1.15 parallel BLAS libraries. 
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In [7, 18], a solution to this scalability problem is presented. The approach 
described in [7, 18] consists of breaking the trailing submatrix update into smaller 
tasks that operate on a block-column (i.e., a set of b contiguous columns where 
b is the block size). The algorithm can then be represented as a Directed Acyclic 
Graph (DAG) where nodes represent tasks, either panel factorization or update 
of a block-column, and edges represent dependencies among them. The execution 
of the algorithm is performed by asynchronously scheduling the tasks in a way 
that dependencies arc not violated. This asynchronous scheduling results in an 
out-of-order execution where slow, sequential tasks are hidden behind parallel 
ones. This approach can be described as a dynamic lookahead technique. Even 
if this approach provides significant speedup, as shown in [18], it is exposed 
to scalability problems. Due to the relatively high granularity of the tasks, the 
scheduling of tasks may have a limited flexibility and the parallel execution of 
the algorithm may be affected by an unbalanced load. 

The following sections describe the application of this idea of dynamic schedul- 
ing and out of order execution to an algorithm for QR factorization where finer 
granularity of the operations and higher flexibility for the scheduling can be 
achieved. 

3 Tiled QR Factorization 

The idea of dynamic scheduling and out of order execution can be applied to 
a class of algorithms that allow the parallelization of common Linear Algebra 
operations. Previous work in this direction includes SYRK, CHOL, block LU, 
and block QR [7, 8, 18]. For those four factorizations, no algorithmic change 
is needed however, while CHOL and SYRK can be naturally "tiled", the algo- 
rithms for block LU and block QR factorizations involved tall and skinny panel 
factorization that represents the bottlenecks of the computation (see pervious 
Section). In order to have a finer granularity in LU and QR, we need to "tile" 
the operations. To do so we will need a major algorithmic change in LU and QR. 

The algorithmic change we propose is actually well-known and takes its roots 
in updating factorizations [13, 23]. Using updating techniques to tile the algo- 
rithms have first 5 been proposed by Yip [26] for LU to improve the efficiency of 
out-of-core solvers, and were recently reintroduced in [15, 21] for LU and QR, 
once more in the out-of-core context. A similar idea has also been proposed in [6] 
for Hcsscnbcrg reduction in the parallel distributed context. 

The originality of this paper is to study this techniques in the multicorc 
context, where they enable us to schedule operations of very fine granularity. 

3.1 A Fine-Grain Algorithm for QR Factorization 

The QR factorization by blocks will be constructed based on the following four 
elementary operations: 

5 to our knowledge 
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DGEQT2. This subroutine was developed to perform the unblocked factorization 
of a diagonal block Akk of size b x b. This operation produces an upper 
triangular matrix R k k, a unit lower triangular matrix V k k that contains b 
Householder reflectors and an upper triangular matrix Tkk as defined by the 
WY technique for accumulating the transformations. Note that both Rkk 
and Vkk can be written on the memory area that was used for Akk and, 
thus, no extra storage is needed for them. A temporary work space is needed 
to store T k k- 

Thus, DGEQT2(yl fcfc , T kk ) performs 

Akk < — Vkk, Rkk Tkk < — Tkk 

DLARFB. This LAPACK subroutine will be used to apply the transformation 
{Vkk, Tkk) computed by subroutine DGEQT2 to a block Akj- 
Thus, DLARFB (Afcj, V k k, T kk ) performs 

Akj < — {I - VkkTkkVjF^Akj 

DTSQT2. This subroutine was developed to perform the unblocked QR factoriza- 
tion of a matrix that is formed by coupling an upper triangular block Rkk 
with a square block Aik- This subroutine will return an upper triangular 
matrix Rkk which will overwrite Rkk and b Householder reflectors where b is 
the block size. Note that, since Rkk is upper triangular, the resulting House- 
holder reflectors can be represented as an identity block I on top of a square 
block Vik- For this reason no extra storage is needed for the Householder vec- 
tors since the identity block need not be stored and Vik can overwrite A ik - 
Also a matrix is produced for which storage space has to be allocated. 
Then, DTSQT2(i? fefe , A ik , T lk ) performs 



Rkk\ < ( I 

Aik J V Vik 



, Rkk 



t ik < -L ik 



DSSRFB. This subroutine was developed to apply the transformation computed 
by DTSQT2 to a matrix formed coupling two square blocks Akj and Aij. 
Thus, DSSRF(Ajy, A^, V ik , T tk ) performs 

A kj \ ( j_( I\ iT ik y(IVT) \ (A k0 



An I \ \ Vik 1 I \ A 

All of this elementary operations rely on BLAS subroutines to perform in- 
ternal computations. 

Assuming a matrix A of size pb x qb 

(An A 12 ... A lq \ 

A2I A22 ■ ■ ■ A^ q 



\ A p i A p 2 . . . A pq J 
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Algorithm 1 The block algorithm for QR factorization. 

1: for k = 1, 2..., min(p, q) do 

2: DGEQT2(A fcfc , T kk ); 

3: for j = k + 1, k + 2, q do 

4: DLARFB(A fcj , V fcfc , T fefc ); 

5: end for 

6: for i = k + 1, fc + 1, ...,p do 

7: DTSQT2(_Rfcfe, T lfc ); 

8: for j = fc + 1, k + 2, ...,q do 

9: DSSRFB(yl fc j, A;,-, V ik , T ik ); 

10: end for 

11: end for 

12: end for 



where b is the block size and each is of size b x b, the QR factorization can 
be performed as in Algorithm 1. 

Figure 2 gives a graphical representation of one repetition (with k = 1) of 
the outer loop in Algorithm 1 with p = q = 3. The red, thick borders show what 
blocks in the matrix are being read and the light blue fill shows what blocks are 
being written in a step. The Tkk matrices are not shown in this figure for clarity 
purposes. 

3.2 Operation count 

This section shows that Algorithm 1 has a higher operation count than the LA- 
PACK algorithm discussed in Section 2. Performance results in Section 4 will 
demonstrate that it is worth paying this cost for the sake of scaling. The oper- 
ation count of the block algorithm for QR factorization can be derived starting 
from the operation count of each elementary operation; assuming that b is the 
block size: 

DGEQT2: this operation is a standard non blocked QR factorization of a b x b 
matrix where, in addition, the T kk triangular matrix is computed. Thus, 
4/36 3 floating point operations are performed for the factorization plus 2 /3b 3 
for computing T k k- This subroutine accounts for 26 3 floating point operations 
total. 

DLARFB: since both Vtk and Tkk are triangular matrices, 36 3 floating point op- 
erations are done in this subroutine. 

DTSQT2: taking advantage of the triangular structure of Rkk, the factorization 
can be computed in 26 3 floating point operations. The computation of the 
triangular matrix can also be performed exploiting the structure of the 
Householder vectors built during the factorization (remember that the b 
reflectors can be represented as an identity block on top of a square full 
block). Since 4/36 3 arc needed to compute T^, the DTSQT2 accounts for 
10/36 3 floating point operations. 
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Fig. 2. Graphical representation of one repetition of the outer loop in Algorithm 1 on 
a matrix with p — q = 3. As expected the picture is very similar to the out-of-core 
algorithm presented in [15]. 



DSSRFB: exploiting the structure of the Householder reflectors and of the 
matrix computed in DTSQT2, this subroutine needs 5b 3 floating point opera- 
tions. 

For each repetition of the outer loop in Algorithm 1, one DGEQT2, q — k 
DLARFB, p — k DTSQT2 and (p — k)(q - k) DSSRFB are performed for a total of 
2b 3 + 3(q - k)b 3 + 10/3(p - k)b 3 + 5(p - k)(q - k)b 3 . Assuming, without loss of 
generality, that q < p and integrating this quantity over all the q repetitions of 
the outer loop in Algorithm 1, the total operation count for the QR factorization 
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is 

ELi( 2&3 + % - k)b 3 + f(p - k)b 3 + 5(p - k)(q - k)b 3 ) 
^h 2 (p-l)b 3 (4) 
= §n 2 (m-f). 

Equation (4) shows that the block algorithm for QR factorization needs 25% 
more floating point operations than the standard LAPACK algorithm. 

Note that if we have used the nonblocked version of the Householder appli- 
cation (DLARF instead of DLARFB) then the number of FLOPS for the tiled 
algorithm and the block algorithm would have been exactly the same. 

An easy twist to reduce the 25% overhead of the block version is to use 
nonsquare blocks. For example, using 2b x b blocks would reduce the overhead 
at 12.5%. 



3.3 Graph driven asynchronous execution 

Following the approach presented in [7, 18], Algorithm 1 can be represented as a 
Directed Acyclic Graph (DAG) where nodes are elementary tasks that operate 
on b x b blocks and where edges represent the dependencies among them. Figure 3 
show the DAG when Algorithm 1 is executed on a matrix with p = q = 3. Note 
that the DAG has a recursive structure and, thus, if p\ > P2 and qi > q 2 then 
the DAG for a matrix of size p 2 x q 2 is a subgraph of the DAG for a matrix of 
size p\ x q\. This property also holds for most of the algorithms in LAPACK. 

Once the DAG is known, the tasks can be scheduled asynchronously and 
independently as long as the dependencies are not violated. A critical path can 
be identified in the DAG as the path that connects all the nodes that have the 
higher number of outgoing edges. Based on this observation, a scheduling policy 
can be used, where higher priority is assigned to those nodes that lie on the 
critical path. Clearly, in the case of our block algorithm for QR factorization, 
the nodes associated to the DGEQT2 subroutine have the highest priority and 
then three other priority levels can be defined for DTSQT2, DLARFB and DSSRFB 
in descending order. 

This dynamic scheduling results in an out of order execution where idle time 
is almost completely eliminated since only very loose synchronization is required 
between the threads. Figure 4 shows part of the execution flow of Algorithm 1 on 
a 16-cores machine (8-way Dual Opteron) when tasks are dynamically scheduled 
based on dependencies in the DAG. Each line in the execution flow shows which 
tasks are performed by one of the threads involved in the factorization. 

Figure 4 shows that all the idle times, which represent the major scalability 
limit of the fork-join approach, can be removed thanks to the very low synchro- 
nization requirements of the graph driven execution. The graph driven execution 
also provides some degree of adaptivity since tasks are scheduled to threads de- 
pending on the availability of execution units. 
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Fig. 3. The dependency graph of Algorithm 1 on a matrix with p — q — 3. 
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Fig. 4. The execution flow for dynamic scheduling, out of order execution of Algo- 
rithm 1. 



3.4 Block Data Layout 



The major limitation of performing very fine grain computations, is that the 
BLAS library generally have very poor performance on small blocks. This situ- 
ation can be considerably improved by storing matrices in Block Data Layout 
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(BDL) instead of the Column Major Format that is the standard storage format 
for FORTRAN arrays. 



"H" 



' ' * ' 



t T T t 



Fig. 5. A comparison of Column Major storage format (left) and Block Data Layout 
(right). 



Figure 5 compares Column Major Format (left) and Block Data Layout 
(right). In BDL a matrix is split into blocks and each block is stored into contigu- 
ous memory locations. Each block is stored in Column Major Format and blocks 
are stored in Column Major Format with respect to each other. As a result the 
access pattern to memory is more regular and BLAS performance is consider- 
ably improved. The benefits of BDL have been extensively studied in the past, 
for example in [16], and recent studies like [8] demonstrate how fine-granularity 
parallel algorithms can benefit from BDL. 

4 Performance Results 

The performance of the tiled QR factorization with dynamic scheduling of tasks 
has been measured on the systems listed in Table 2 and compared to the per- 
formance of the fork-join approach, i.e., the standard algorithm for block QR 
factorization of LAPACK associated with multithreaded BLAS. 

Figures 6, 7, 8, 9 report the performance of the QR factorization for both 
the block algorithm with dynamic scheduling and the LAPACK algorithm with 
multithreaded BLAS. A block size of 200 has been used for the block algorithm 
while the block size for the LAPACK algorithm 6 has been tuned in order to 
achieve the best performance for all the combinations of architecture and BLAS 
library. 

In each graph, two curves are reported for the block algorithm with dynamic 
scheduling; the solid curve shows its relative performance when the operation 
count is assumed equal to the one of the LAPACK algorithm reported in Sec- 
tion 2 while the dashed curve shows its "raw" performance, i.e. the actual flop 



the block size in the LAPACK algorithm sets the width of the panel. 
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8-way dual Opteron 


2-way quad Clovertown 


Architecture 


Dual-Core AMD 


Intel®Xeon®CPU 




Opteron™8214 


X5355 


Clock speed 


2.2 GHz 


2.66 GHz 


# cores 


8 x 2 = 16 


2x4 = 8 


Peak performance 


70.4 Gflop/s 


85.12 Gflop/s 


Memory 


62 GB 


16 GB 


Compiler suite 


Intel 9.1 


Intel 9.1 


BLAS libraries 


GotoBLAS-1.15 


GotoBLAS-1.15 




MKL-9.1 


MKL-9.1 



Table 2. Details of the systems used for the following performance results. 



rate computed with the exact operation count for this algorithm (given in Equa- 
tion (4)). As already mentioned, the "raw performance" (dashed curve) is 25% 
higher than the relative performance (solid curve). 

The graphs on the left part of each figure show the performance measured 
using the maximum number of cores available on each system with respect to 
the problem size. The graphs on the right part of each figure show the weak 
scalability, i.e. the flop rates versus the number of cores when the local problem 
size is kept constant (nloc=5,000) as the number of cores increases. 



QR — 8-way dual Opteron, GotoBLAS-1 .15 Q R — 8-way dual Opteron, GotoBLAS-1.15 




problem size ~ # of processors 

Fig. 6. Comparison between the performance of the block algorithm with dynamic 
scheduling using GotoBLAS-1.15 on an 8-way dual Opteron system. The dashed curve 
reports the raw performance of the block algorithm with dynamic scheduling, i.e., the 
performance as computed with the true operation count in Equation (4). 



Figures 6, 7, 8, 9 show that, despite the higher operation count, the block 
algorithm with dynamic scheduling is capable of completing the QR factorization 
in less time than the LAPACK algorithm when the parallelism degree is high 
enough that the benefits of the asynchronous execution overcome the penalty 
of the extra flops. For lower numbers of cores, in fact, the fork-join approach 
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Fig. 7. Comparison between the performance of the block algorithm with dynamic 
scheduling using GotoBLAS-1.15 on an 2-way quad Clovertown system. The dashed 
curve reports the raw performance of the block algorithm with dynamic scheduling, 
i.e., the performance as computed with the true operation count in Equation (4). 




2000 4000 6000 8000 10000 12000 14000 
problem size 



# of processors 



Fig. 8. Comparison between the performance of the block algorithm with dynamic 
scheduling using MKL-9.1 on an 8-way dual Opteron system. The dashed curve re- 
ports the raw performance of the block algorithm with dynamic scheduling, i.e., the 
performance as computed with the true operation count in Equation (4). 
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Fig. 9. Comparison between the performance of the block algorithm with dynamic 
scheduling using MKL-9.1 on an 2-way quad Clovertown system. The dashed curve 
reports the raw performance of the block algorithm with dynamic scheduling, i.e., the 
performance as computed with the true operation count in Equation (4). 

has a good scalability and completes the QR factorization in less time than the 
block algorithm because of the lower flop count. Note that the actual execution 
rate of the block algorithm for QR factorization with dynamic scheduling (i.e., 
the dashed curves) is always higher than that of the LAPACK algorithm with 
multithreaded BLAS even for low numbers of cores. The actual performance of 
the block algorithm, even if considerably higher than that of the fork-join one, is 
still far from the peak performance of the systems used for the measures. This is 
mostly due to two factors. First the nature of the BLAS operations involved; the 
DGEQR2 and the DLARFT in the LAPACK algorithm and the DGEQT2 and DTSQT2 in 
the block algorithm are based on Level 2 BLAS operations that, being memory 
bound, represent a limit for performance. Second, the performance of BLAS 
routines on small size blocks. The block size used in the experiments reported 
above is 200; this block size represents a good compromise between flexibility of 
the scheduler and performance of the BLAS operations but it is far from being 
ideal. Such a block size, in fact, does not allow a good task scheduling for smaller 
size problems and still the performance of BLAS operations is far from what can 
be achieved for bigger size blocks. 

5 Conclusion 

By adapting known algorithms for updating the QR factorization of a matrix, 
we have derived an implementation scheme of the QR factorization for multicore 
architectures based on dynamic scheduling and block data layout. Although the 
proposed algorithm is performing 25% more FLOPS than the regular algorithm, 
the gain in flexibility allows an efficient dynamic scheduling which enables the 
algorithm to scale almost perfectly when the number of cores increases. 
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We note that the 25% overhead can be easily reduced by using nonsquare 
blocks. For example, using 2b x b blocks, the overhead reduces to 12.5%. 

While this paper only addresses the QR factorization, it is straightforward 
to derive with the same ideas the two important computational routines that 
consists in applying the Q factor to a set of vectors (see DORMQR in LAPACK) 
and constructing the Q-factor (see DORGQR in LAPACK). 
The ideas behind this work can be extended in many directions: 

Implement other linear algebra operations. The LU factorization can be 
performed with an algorithm that is analogous to the QR one described in 
Section 3. This algorithm has been discussed in [21, 26] as a way of improving 
the out-of-core LU factorization. Even though the only difference between 
the block algorithms for the LU and QR factorizations is in the elementary 
operations, in the LU case the cost of block algorithm is 50% higher than the 
LAPACK algorithm. For this reason, the benefits of the improved scalability 
may be visible only at very high processor counts or may not be visible at 
all. Techniques must be investigated to eliminate or reduce the extra cost. 
The same to blocking may also be applied to other two sided transforma- 
tions like Hcsscnberg reduction, Tridiagonalization and Bidiagonalization. In 
these transformations, Level 2 BLAS operations are predominant and panel 
reductions account for almost 50% of the time of a sequential execution. 
Breaking the panel into smaller tasks that can be executed in parallel with 
other tasks may yield considerable performance improvements. 

Enforcing data locality. The results proposed in [8] show that enforcing data 
locality and CPU affinity may provide considerable benefits. It must be noted 
that the improvements that can be expected on non-multicore SMPs are 
higher than on currently available multicore systems and this is due to the 
fact that on multicores, some of the higher level memories are shared between 
multiple cores. Moreover enforcing data locality has a major drawback in the 
fact that it seriously limits the scheduling of tasks since each core can only 
be assigned tasks that operate on data that resides on the memory associ- 
ated with it. Preliminary results show that enforcing data locality and CPU 
affinity provides a slight speedup on the 8-way Dual Optcron system which 
is a NUMA architecture. These techniques require further investigation. 

Implement the same algorithms in distributed memory systems. The fact 
that the block algorithms for QR and LU factorizations only require loose 
synchronization between tasks make them also good candidates for the im- 
plementation on distributed memory systems based on MPI communications. 

Implement the same algorithms on the STI Cell architecture. In the STI 
Cell processor, no caches are present but a small, explicitly managed memory 
is associated to each core. Due to the small size of these local memories (only 
256 KB), the LAPACK algorithms for LU and QR factorizations cannot be 
efficiently implemented. The block algorithms for LU and QR factorizations 
represent ideal candidates for the STI Cell architecture since they can be 
parallelized with a very fine granularity. 

Explore the usage parallel programming environments. The task of im- 
plementing Linear Algebra operations with dynamic scheduling of tasks on 
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multicore architectures can be considerably simplified by the use of graph 
driven parallel programming environments. One such environment is SMP 
Superscalar [4] developed at the Barcelona Supercomputing Center. SMP Su- 
perscalar addresses the automatic exploitation of the functional parallelism 
of a sequential program in multicore and SMP environments. The focus in on 
the portability, simplicity and flexibility of the programming model. Based 
on a simple annotation of the source code, a source to source compiler gener- 
ates the necessary code and a runtime library exploits the existing parallelism 
by building at runtime a task dependency graph. The runtime takes care of 
scheduling the tasks and handling the associated data. Besides, a temporal 
locality driven task scheduling can be implemented. 
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